- Notifications
You must be signed in to change notification settings - Fork 849
/
Copy pathOML4Py Importing Wide Datasets.dsnb
executable file
·1 lines (1 loc) · 20.3 KB
/
OML4Py Importing Wide Datasets.dsnb
1
[{"layout":null,"template":null,"templateConfig":null,"name":"OML4Py Importing Wide Datasets","description":null,"readOnly":false,"type":"medium","paragraphs":[{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":null,"title":null,"message":["%md"," "],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":true,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":null,"message":["%md","","# Importing Wide Datasets using OML4Py","","Oracle Database has a limit of 1,000 columns in a table or view. Machine learning projects may involve datasets, perhaps stored as CSV files, with a higher number of columns. OML4SQL in-database algorithms can handle such wide datasets by converting the columns to a special format known as \"nested columns\". This allows an efficient handling of high dimensional and sparse features, as well as accomodating more than 1000 predictor columns, when building OML machine learning models. However, importing a dataset with the more than 1,000 columns from a data file, such as a CSV file in Object Storage, involves some preprocessing. ","","In this notebook, we demonstrate the process of importing a wide dataset with 2,002 columns from CSV format into an Oracle Database table with nested columns. ","","The approach is to read the file by chunks, convert each chunk to a transactional format into a database table, and then pivot the data into the nested column format using SQL. Note that the nested column format can be used only through OML4SQL, not OML4Py. ","","To improve the efficiency of the CSV loading, the \"index apply\" functionality of OML4Py is also used to speed up the process and make it parallel.","","Copyright (c) 2024 Oracle Corporation ","###### [The Universal Permissive License (UPL), Version 1.0](https://oss.oracle.com/licenses/upl/)","---"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":"For more information...","message":["%md","","* <a href=\"https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/index.html\" target=\"_blank\">Oracle ADB documentation<\/a>","* <a href=\"https://github.com/oracle-samples/oracle-db-examples/tree/main/machine-learning\" target=\"_blank\">OML folder on Oracle GitHub<\/a>","* <a href=\"https://www.oracle.com/machine-learning\" target=\"_blank\">OML website<\/a>","* <a href=\"https://docs.oracle.com/en/database/oracle/oracle-database/21/adobj/multilevel-collection-types.html\" target=\"_blank\">Oracle nested columns table documentation<\/a>","* <a href=\"https://docs.oracle.com/en/database/oracle/machine-learning/oml4sql/21/dmprg/using-nested-dita.html\" target=\"_blank\">OML using nested data documentation<\/a>","* <a href=\"https://docs.oracle.com/en/database/oracle/machine-learning/oml4py/1/mlpug/embedded-python-execution1.html\" target=\"_blank\">OML4Py embedded Python execution documentation<\/a>"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":"Import Python libraries","message":["%python","","import pandas as pd","import oml"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Create a table to hold the data in transactional format","message":["%script ","","BEGIN"," EXECUTE IMMEDIATE 'DROP TABLE TXN_TBL';"," EXCEPTION WHEN OTHERS THEN NULL;","END;","/","","CREATE TABLE TXN_TBL( ID NUMBER, "," NAME VARCHAR2(32),"," VALUE NUMBER"," );"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Create a table that holds the label","message":["%script","","BEGIN"," EXECUTE IMMEDIATE 'DROP TABLE WIDE_LABEL_TBL';"," EXCEPTION WHEN OTHERS THEN NULL;","END;","/","","CREATE TABLE WIDE_LABEL_TBL( ID NUMBER, "," CLASS INTEGER"," );","/"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":null,"message":["%md","","### Create supporting user-defined Python functions","* Function to insert multiple records into a database table","* Function to read the data in partitions and pivot the data into transactions","* Function for generating the partitions for the data","","---"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":"Functions to insert multiple records into the database table","message":["%python","","import oracledb","","# Insert records using the provided cursor, the statement, and the records","","def insert_records(cr, records, statement):"," try: "," cr.executemany(statement, records)"," cr.connection.commit()"," except cx_Oracle.Error as error:"," print('Insert error:')"," print(error)","","# Append the rows in a partition and call the insert_records function","","def insertDF(df, statement):"," cr = oml.cursor()",""," records = []"," for row in df.itertuples(index=False):"," records.append(row)"," insert_records(cr, records, statement)"," cr.close() "],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":"Function that writes each partition into transactional format","message":["%python","","import gc ","","def import_helper(idx, url, columns, ncol, partitions, chunk_size):"," # Verifies the partition to read"," read_rows = partitions[idx-1][0]"," # Check the largest index of that partition"," total = partitions[idx-1][1] + 1"," "," while read_rows < total:"," offset = read_rows + 1"," # Reads the data from the defined chunks"," df = pd.read_csv(url, skiprows = offset, nrows = chunk_size, header = None)"," read_rows += df.shape[0]"," df.columns = columns"," # Pivots the data"," txn_df = pd.melt(df, id_vars = ['ID'], value_vars = columns[:ncol])"," txn_df.columns = ['ID', 'NAME', 'VALUE']"," print('Rows '+str(min(txn_df['ID']))+' to '+str(max(txn_df['ID']))+"," ': data pivoted')"," # Write the pivotted data to a database table "," statement = \"INSERT INTO TXN_TBL (ID, NAME, VALUE) VALUES(:ID, :NAME, :VALUE)\""," insertDF(txn_df, statement)"," statement = \"INSERT INTO WIDE_LABEL_TBL (ID, CLASS) VALUES(:ID, :CLASS)\""," label_df = df[['ID','CLASS']]"," insertDF(label_df, statement)"," print('Rows '+str(min(txn_df['ID']))+' to '+str(max(txn_df['ID']))+"," ': inserted into TXN_TBL '+str(txn_df.shape[0])+' rows and '+str(txn_df.shape[1])+' columns')"," # Cleanup temporary elements used"," del df"," del txn_df"," del label_df"," gc.collect()"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":"Function for generating the partitions for the data","message":["%python","","def generate_part(rows_to_process, num_partitions):"," segment_size = int(rows_to_process/num_partitions)"," result = []"," for i in range(num_partitions):"," result.append([i*segment_size, (i+1)*segment_size-1])"," result[len(result)-1][1] = rows_to_process-1"," return result"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":null,"message":["%md","","### Non-parallel processing","","In the following example, we process the data in chunks sequentially. This takes a significant amount of time. Later, we show doing this in parallel using embedded Python execution. ","","---"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Check that the table is empty before inserting","message":["%sql","","SELECT COUNT(1) FROM TXN_TBL "],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":5,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Create separate partition slices to take advantage of parallelism ","message":["%python","","# Define the total number or rows to process from the dataset","# to a fixed number for this example","rows_to_process = 2000","","# Specify the number of partitions to create","num_partitions = 5","","partitions = generate_part(rows_to_process, num_partitions)","print(\"Generated partitions:\\n\")","print(partitions)"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":7,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":"[{\"raw\":{\"height\":300,\"lastColumns\":[],\"version\":1}}]","hideInIFrame":false,"selectedVisualization":"raw","title":"Sequential run processing the first 2000 rows","message":["%python","","import ssl","","ssl._create_default_https_context = ssl._create_unverified_context","","","url = 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/adwc4pm/b/OML_Data/o/wide.csv'","","df = pd.read_csv(url, nrows = 0, skiprows = 0)","columns = list(df.columns)","","# Number of numerical columns in the dataset to be processed","ncol = 2000","# Set the number of records to process at a time (per chunk)","chunk_size = 100","","for i in range(1,6):"," import_helper(i, url, columns, ncol, partitions, chunk_size)"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"table","title":"Show a sample of the resulting transactions","message":["%sql","","select * from TXN_TBL FETCH FIRST 10 ROWS ONLY"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":null,"message":["%md","","### Parallel processing using Embedded Python Execution","","The performance improvement is relative to the degree of parallelism enabled. Depending on your Autonomous Database instances CPU allocation, you can explicitly choose the number of Python engines to spawn in parallel. ","","---"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Remove all data before parallel run","message":["%script","","DELETE FROM WIDE_LABEL_TBL;","DELETE FROM TXN_TBL;"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":4,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":"[{\"raw\":{\"height\":300,\"lastColumns\":[],\"version\":1}}]","hideInIFrame":false,"selectedVisualization":"raw","title":"Run in parallel using oml.index_apply","message":["%python","","url = 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/adwc4pm/b/OML_Data/o/wide.csv'","","df = pd.read_csv(url, nrows = 0, skiprows = 0)","columns = list(df.columns)","","ncol = 2000","chunk_size = 100","_ = oml.index_apply(times = 5, func = import_helper, url = url, columns = columns, "," ncol = ncol, partitions = partitions, chunk_size = 100, "," graphics = False, parallel = True)"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":8,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Check the total data loaded","message":["%sql","","select count(1)","from txn_tbl"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"table","title":"Check the table with id = 0","message":["%sql","","SELECT *","FROM txn_tbl","WHERE id = 0","FETCH FIRST 10 ROWS ONLY"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Check if there are any duplicates","message":["%sql","","SELECT count(distinct name)","FROM TXN_TBL","where id = 0"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"table","title":"Show the label/class and the id","message":["%sql","","SELECT * FROM WIDE_LABEL_TBL FETCH FIRST 10 ROWS ONLY"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":null,"message":["%md","","### Create the nested column from the transactional data"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Create the nested column table and join the labels","message":["%script","","","BEGIN EXECUTE IMMEDIATE 'DROP TABLE TXN_NESTED_COL_NUM';","EXCEPTION WHEN OTHERS THEN NULL; END;","/","","CREATE TABLE TXN_NESTED_COL_NUM"," NESTED TABLE txn_nested store AS txn_nested_tab AS ","SELECT m.id, m.CLASS, txn_nested"," FROM WIDE_LABEL_TBL m,"," ( SELECT features.id,"," cast(COLLECT(dm_nested_numerical(features.name,features.value)) "," as dm_nested_numericals) txn_nested"," FROM TXN_TBL features"," group by features.id) n"," WHERE m.id = n.id ","/"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Check that the table contains our nested column","message":["%script","","SELECT * FROM TXN_NESTED_COL_NUM FETCH FIRST 5 ROWS ONLY"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":null,"message":["%md","","### Building a Machine Learning model with a Nested Column as input"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"raw","title":"Build a model using the nested column via OML4SQL","message":["%script","","BEGIN DBMS_DATA_MINING.DROP_MODEL('GLM_CLAS_USING_NESTED_COLUMN');","EXCEPTION WHEN OTHERS THEN NULL; END;","/","","DECLARE"," v_setlst DBMS_DATA_MINING.SETTING_LIST;","","BEGIN"," "," v_setlst(dbms_data_mining.algo_name) := dbms_data_mining.algo_generalized_linear_model;"," v_setlst(dbms_data_mining.prep_auto) := dbms_data_mining.prep_auto_on;"," "," "," DBMS_DATA_MINING.CREATE_MODEL2("," model_name => 'GLM_CLAS_USING_NESTED_COLUMN',"," mining_function => dbms_data_mining.classification,"," data_query => 'select * from TXN_NESTED_COL_NUM',"," set_list => v_setlst,"," case_id_column_name => 'ID',"," target_column_name => 'CLASS');","END;","/ "],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"table","title":"View the attributes used in building the model","message":["%sql","","SELECT attribute_name, attribute_type"," FROM user_mining_model_attributes"," WHERE model_name = 'GLM_CLAS_USING_NESTED_COLUMN'","ORDER BY attribute_name;"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":6,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"table","title":"List of available model views with various model details","message":["%sql","","SELECT *"," FROM user_mining_model_views"," WHERE model_name = 'GLM_CLAS_USING_NESTED_COLUMN'"],"enabled":true,"result":null,"sizeX":0,"hideCode":false,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":true,"hideVizConfig":false,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":null,"message":["%md","","## End of Script"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":false,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"},{"col":0,"visualizationConfig":null,"hideInIFrame":false,"selectedVisualization":"html","title":null,"message":["%md"],"enabled":true,"result":null,"sizeX":0,"hideCode":true,"width":12,"hideResult":true,"dynamicFormParams":null,"row":0,"hasTitle":false,"hideVizConfig":true,"hideGutter":true,"relations":[],"forms":"[]"}],"version":"6","snapshot":false,"tags":null}]